Web Scraping

Connect to RSelenium server from within R which automating web browsers.

For convinience, I save this dataframe into a csv and read it in.

Remove the non-valid UTF-8 characters.

Sentiment Analysis

The term frequency table counts the the frequency of each different words in the reviews of movie Parasite excluing the dulpicate and unmeaningful words. By displaying the top 30 terms with the highest frenquency, we could sense a positive emotion among the reviews. Word such as best, good, and great all reveal an affirmative attitude toward this movie.

Word cloud

Data Cleaning

By aggregating all the reveiws of a specific date together, I used sentiment function to get an overall sentiment score for each day. The table below shows the date, the numbers of words in the reviews on that date, and the sentiment score of the reviews on that date. The proportion of negative review on daily basis is 14.16% and the postive reviews have a much higher percentage 82.30%. This might further prove the finding of positive emotion in frequency table. It is sloppy to make a conclusion that the majority of reviews are postive by just looking a few comments on the first page of IMDB, and that explains why I scaped all the reviews of Parasite from the IMDB to conduct this sentiment analysis.

Negative and Positive Proportion

## [1] 0.1415929
## [1] 0.8230088

The primary reason for aggregating reviews in daily basis is to discover the possible relationship between the sentiment in reviews and time. The graph illustrates that the sentiment of reviews stays almost the same with a slightly increase in score recently.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Image Detection

Parasite Cover

Parasite Cover

## $file
## [1] "/Users/yuangao/Documents/Unstructured/hw3/HW3/para.jpg"
## 
## $type
##        label probability
## 1 comic book  0.23770124
## 2       jean  0.06936955
## 3   web site  0.05744276
## 4    maillot  0.05093752
## 5     jersey  0.03305830

topic model

Topic models are great for predicting topic probabilities for unseen document, and in this case it’s unrealistic for us to manually sperate a large number of words into different topics. Therefore, topic model could help us clustering the reveiws into appropriate number of topics and finding how does each topic’s percentage changed overtime.

Since prepDocuments function only remove infrenquent word, and we would like to avoid the situition that all the topics including some common words such as movie, the Parasite dataframe removed some duplicated words.

Using the utility function prepDocuments to process the loaded data to make sure it is in the right format.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...
## Removing 7319 of 12558 terms (7319 of 53206 tokens) due to frequency 
## Removing 1 Documents with No Words 
## Your corpus now has 175 documents, 5239 terms and 45887 tokens.

The 4 plots below help us find the optimal number of topics to take, and we should select the number of topics with relatively high semantic coherence and low residual. Therefore, the model with 10 topics is picked here. It has the highest semantic coherence value and the lowest residuals among all the numbers of topics I tested. The held-out-likelyhood, which means it’s less likely to generate a suprising result with new words is also reasonable under 10 topics.

From this graph, we could tell that topic 8 is the largest proportion and then topic 3…

It is clear to find the expected topic proportions from the grpah above, but three words with the highest freqency in each model can not tell a complete story of what is each topic about. Therefore, the labeltopic gives us more information by showing the words with the highest prob, highest frequency excluding to other topics, Lift and scores. From frex of each topic, we could assign a lable for each topic.

Topic1: Humanity Topic2: Scene Topic3: Director Topic4: Social issue (polarization) Topic5: Theatre & Viewer Topic6: Academy Topic7: Other films Topic8: Actors Topic9: Storyline Topic10:Expectation

## Topic 1 Top Words:
##       Highest Prob: well, story, class, bong, kim, park, best 
##       FREX: spent, human, taek, deeply, looks, body, ground 
##       Lift: desires, gifted, spent, superiority, transformation, humiliation, taek 
##       Score: spent, kitaeks, chose, disease, taek, slave, body 
## Topic 2 Top Words:
##       Highest Prob: rich, poor, like, class, good, way, dont 
##       FREX: rain, window, smell, underground, murder, play, birthday 
##       Lift: audiencethe, casts, disparate, doted, ducked, faced, fridge 
##       Score: tent, rain, dude, traditional, disparate, fridge, inferiority 
## Topic 3 Top Words:
##       Highest Prob: bong, think, best, poor, rich, like, great 
##       FREX: think, planning, themes, review, layers, original, want 
##       Lift: blink, comedydrama, excellently, junho, obtaining, undertones, welloverall 
##       Score: junho, planning, masters, organism, jokes, kitaeks, paper 
## Topic 4 Top Words:
##       Highest Prob: story, poor, end, rich, people, like, good 
##       FREX: played, festival, apartment, end, yeonkyo, humor, evil 
##       Lift: blur, casualties, composition, dysfunctional, elegance, empathize, focusing 
##       Score: heaven, clues, views, blur, sydney, roll, casualties 
## Topic 5 Top Words:
##       Highest Prob: like, story, still, time, people, feel, dont 
##       FREX: theatre, sat, still, dread, truly, watching, level 
##       Lift: utter, adaptation, amazement, attend, attended, betrayed, bricks 
##       Score: eager, sat, utter, dread, amazement, fullblown, stance 
## Topic 6 Top Words:
##       Highest Prob: best, like, korean, story, good, films, way 
##       FREX: academy, picture, surely, set, nominees, mean, oscars 
##       Lift: accurate, believability, blank, competitive, conartist, downright, lucrative 
##       Score: plain, nominees, multilayered, interior, instance, architecture, stage 
## Topic 7 Top Words:
##       Highest Prob: best, like, good, poor, great, well, rich 
##       FREX: oscars, koreans, suspenseful, parts, joker, oscar, picture 
##       Lift: gala, movieit, slightest, anchor, bravo, carried, confess 
##       Score: movieit, koreans, symbolizes, native, wellmade, gala, slightest 
## Topic 8 Top Words:
##       Highest Prob: good, like, best, people, story, way, poor 
##       FREX: writing, husband, creativity, con, opinion, enjoyed, dramatic 
##       Lift: hitchcockian, restrained, bubble, dilemma, february, finest, goodthe 
##       Score: shaped, finest, creativity, slip, sub, hitchcockian, superficial 
## Topic 9 Top Words:
##       Highest Prob: park, best, korean, bong, story, kim, like 
##       FREX: mrs, geunsae, kijeong, park, minhyuk, jang, song 
##       Lift: absorbing, advantages, amenities, architects, artistry, backfires, bled 
##       Score: geunsae, mrs, taught, minhyuk, kijeong, replace, convinces 
## Topic 10 Top Words:
##       Highest Prob: like, bong, well, class, kims, park, parks 
##       FREX: expecting, kims, parks, inequality, part, worth, kiwoo 
##       Lift: proven, allthe, captivated, controlled, epilogue, esp, expecting 
##       Score: expecting, relation, popular, relatively, thematic, proven, graphic

Repeat the text processing procedures on test data.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...

Align the old voacbulary in 10 topics to our holdout reviews.

## Your new corpus now has 50 documents, 3301 non-zero terms of 5996 total terms in the original set. 
## 2695 terms from the new data did not match.
## This means the new data contained 63.0% of the old terms
## and the old data contained 55.1% of the unique terms in the new data. 
## You have retained 18843 tokens of the 21880 tokens you started with (86.1%).

predicting thetas for the holdout document based on the previously fited model.

## ..................................................

Estimation with topical prevalence parameter Year.

## 
## Call:
## estimateEffect(formula = 1:10 ~ Year, stmobj = topicPredictor, 
##     metadata = textPrep$meta)
## 
## 
## Topic 1:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.82424   94.39244   1.386    0.168
## Year         -0.06474    0.04675  -1.385    0.168
## 
## 
## Topic 2:
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.28872   88.81219   1.017    0.311
## Year        -0.04468    0.04398  -1.016    0.311
## 
## 
## Topic 3:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  25.96617  107.46722   0.242    0.809
## Year         -0.01280    0.05322  -0.240    0.810
## 
## 
## Topic 4:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  12.448726 106.963797   0.116    0.907
## Year         -0.006105   0.052971  -0.115    0.908
## 
## 
## Topic 5:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -145.37812   99.75809  -1.457    0.147
## Year           0.07204    0.04940   1.458    0.147
## 
## 
## Topic 6:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.84106   90.73958  -0.119    0.905
## Year          0.00541    0.04494   0.120    0.904
## 
## 
## Topic 7:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.751858 108.276008  -0.173    0.863
## Year          0.009342   0.053622   0.174    0.862
## 
## 
## Topic 8:
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  31.04295  111.15252   0.279    0.780
## Year         -0.01531    0.05505  -0.278    0.781
## 
## 
## Topic 9:
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.83015   95.19982   0.376    0.707
## Year        -0.01770    0.04715  -0.375    0.708
## 
## 
## Topic 10:
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -149.00514   84.20059   -1.77   0.0785 .
## Year           0.07383    0.04170    1.77   0.0784 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The proportion of words in Topic1: Humanity decrease over time.

The proportion of words in Topic2: Scene decrease over time.

The proportion of words in Topic3: Director remain constant over time.

The proportion of words in Topic4: Social issue (polarization) remain constant over time.

The proportion of words in Topic5: Theatre & Viewer increase over time.

The proportion of words in Topic6: Academy remain constant over time.

The proportion of words in Topic7: Other films slgithly increase over time.

The proportion of words in Topic8: Actors slightly decrease over time.

The proportion of words in Topic9: Storyline slightly decrease over time.

The proportion of words in Topic10:Expectation increase over time.

From the trendline of the proprotion of words in each topic overtime, we can tell that there would be more reviews about the storyline, the actors, and the scene just after the movie released. However, as time went by, there would be less comments on the content of movie, and more reviews about expectations and the director.

 

A work by Karen